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» getlastMessages( viewer, chats ) 1 14:35 Luke Who isthis ? 


e add( chat, message ) 
e search( viewer, text ) 


» indexMessages() 


CREATE TABLE Messages ( 
chatId, msgld 


user, type, text, attachments[], terminal, deletedBy[l, replyTo,... 


PRIMARY KEY ( chatId, msgId ) 


Microservice Architecture 


e Application Logic 


=> icati 
<> Application State (data) 


=i 


Microservice Architecture 


: SELECT FROM Messages 
» getMessages( viewer, chat, from, to ) WHERE chatId = ? AND 


msgId BETWEEN :from AND :to 


Microservice Architecture 


e getMessages( viewer, chat, from, to ) 1 00 RE k / Sec 
5% 95% 


chats make requests 


e getLastMessages( viewer, chats ) 1 00% < 1 % 


e indexMessages() 
chats requests 
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e CPU: (Un) Marshalling 
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Microservices: costs 


e CPU: (Un) Marshalling 


e Overreads, overwrites 
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Microservices: costs 


P CPU: (Un) Marshalling 


e Overreads, overwrites xi i 
reads and writes 
per request 


. » Network latency and traffic 


| «> 
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Microservices: lowering costs 


e CPU: (Un) Marshalling KV-Direct 
e Overreads, overwrites redis, (3) tarantool 
e Network latency and traffic NetCache 


http://redis.io 
https://tarantool.io memcache 
Netcache: Balancing key-value stores with fast in-network caching. 

In X. Jin et al, Stoica. SOSP, 2017. 

Kv-direct: high-performance in-memory key-value store 

with programmable nic. 

B. Li et al, In SOSP, 2017. 


Stateful ` ` 
Microservices 


Stateful Microservices 


VA (Un)Marshalling 


verreads, overwrites 


etwork latency and traffic 


Application Logic 


Custom in-memory store 
application specific 


Embedded Distributed store code 
only the code is embedded, 
operates just like a dedicated node 
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Stateful Microservices 


More efficient. Ever. 


Application Logic 


Custom in-memory store 
application specific 


Embedded Distributed store code 
only the code is embedded, 
operates just like a dedicated node 
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What ean will go wrong? 


1. Client crash 

2. Server crash 

3. Request omission 
4. Response omission 


5. Server Timeout 


6. Invalid value response 


7. Arbitrary failure 


(6) Distributed systems at OK.ru 
Oleg Anastasyev, Armenian C++ Community Meetup #7 @ ISTC, on 17.12.2022 


ntto://tme/coparm 


Failures 


stateful service 
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Downtime ~ | 


— failure probability of a m 
single machine P(K)=p 


BO 


memcache 


P(/4 K) =1-(1- py 


stateful service 
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Downtime probabilities 
p = 0.1 


PK) = 0.271 


PK) = 0.007 


PC K) = 0.271 


EF T — | a <> 
SK == Í == Í == Y w7 Í 
3 =S == e 22 == 


More reliable 


stateful service 
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implementation 
o, d 


1 4 


i | 
> 


Embedding the Database 


requirements are: 


e Always available 


Replication, Consistency 


e Scalable 
Re-sharding 


e Application language 
Minimal (un) marshalling, 
Integration with the application 


e Open Source 


SO we can code something crazy 
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Embedding the Database 


1. 


2. 


package org.apache.cassandra.serv1ce; 


-cp cassandra/lib/*.jar public class CassandraDaemon 
t 
System. setProperty( "cassandra.config", "file://whatever/cassandra.yaml" ); 


CassandraDaemon. instance.activate(); 
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Request routing 


Request routing 


» Partition-aware client routing library 
Routes request to the replica owning the data, 
based on the key specified in a request and the 
cluster topology information 
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Request routing 


» Partition-aware client routing library 
Routes request to the replica owning the data, 
based on the key specified in a request and the 
cluster topology information 
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Data partitioning 


+ Partition Key ( chatld ) CREATE TABLE Messages ( 


Defines which node owns a row chatId, msgld 


user, type, text, attachments[], terminal, deletedBy[], replyT 
e Clustering Key ( msgld ) 


Defines an order of rows within a partition | PRIMARY KEY ( chatId, msgId ) 
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Data partitioning 


e Partitioner 


Calculates token and position on ring 


e TokenMetadata 


Maps range of tokens to primary nodes 


e Replication Strategy 


Defines replica placement 


token = hash( partition key ) 
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Data partitioning 


SortedMap<Token, List<InetAddress>> endpointMap = ... 


o Partitioner AbstractReplicationStrategy replication =... 


Calculates token and position on ring 
for C Token token : tokenMetadata.sortedTokens() ) 4 


endpointMap.put( token,  replication.getNaturalEndpoints( token : 
» TokenMetadata A ngganti an pats p 9 p C ) 2 


Maps range of tokens to primary nodes 


» Replication Strategy 


Defines replica placement 


-+ Topology changes over the time 
Refreshes and dealing with stale topology 
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CREATE TABLE Messages ( 


Messenger: calling the DB chatld, msgld 


user, type, text, attachments[], terminal, delete 


PRIMARY KEY ( chatId, msgld ) 
ə getMessages( viewer, chat, from, to) 


Quick start 


Here's a short program that connects to Cassandra and executes a query: 


import com.datastax.oss.driver.api.core.CqlSession; 
import com.datastax.oss.driver.api.core.cql.x*; 


try (CqlSession session = CqlSession.builder().build()) { PES 
ResultSet rs - session.execute("select release version from system. local"); // (2) 
Row row - rs.one(); 
System.out.println(row.getString("release version")); // (3) 
} 


https://github.com/datastax/java-driver/tree/4.x/manual/core 34 


CREATE TABLE Messages | 


Messenger: calling the DB chatld, 05910 


user, type, text, attachments[], terminal, delete 
PRIMARY KEY ( chatId, msgld ) 


e getMessages( viewer, chat, from, to ) 


6 add( chat, message ) package org.apache.cassandra.cql3; 


import java.nio.ByteBuffer;. 


public class QueryProcessor 


i 
public static UntypedResultSet execute(String query, 


ConsistencyLevel cl, Object... values) 
throws RequestExecutionException 


UntypedResultSet rs = QueryProcessor. execute( 
"SELECT * FROM Messages " 
+ "WHERE chatld =? AND msgld < ? AND msgld > ?", 
ConsistencyLevel.QUORUM, chatld, from, to ); 


rs.forEachC row -> {} 5; 
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Messages In-Memory Store 


in-memory stored data 


600 bi 5 bi 3+ bi 250 mi 


messages chats messages chats 


100 TB 500 GB 


5% 80% 


chats are active freshest 13 messages 


Messages In-Memory Store 


== imn- t 
——- in-memory store 


3+ bi 250 mi 


» getLastMessages( viewer, chats ) messages 


» getMessages( viewer, chat, from, to ) 


chats 


e add( chat, message ) 


500 GB 


Messages In-Memory Store: getMessages 


O i 


— 
SS 


=> 


| get Messages : 
E NÉ ? : 
berd 
: QueryProcessor.execute 
^ 14:30 Hello, Luke! 
0 "mp Wazzup! ......! 
| put 
return : 


14:30 — Hello, Luke! 
14:35 Wazzup ! 


Messages In-Memory Store 


& => 
SE = FY 
NT QS 
Kg kd 


getMessages | get Mess; 
? : : : fi 


QueryProcessor.execute | 
m 30 Hello, Luke! : | m 30 Hello, Luke! | 
3435. Wazzup! . NE 3435. Wazzup! a. : 
: put : 
: -—- — : return U 


m 30 Hello, Luke! 
14:35 Wazzup! 


In-memory store: freshness 


Au qu 
=z =z 
= 0: m £e e: = 

: getMessages : add E INSERT 

- get 3 : : d : 

| pg 


m 30 Hello, Luke! 
14: 35 WE 


14: 30 Hello, Luke! 
14:35 Wazzup ! 


14: 30 Hello. Luke! 
14:35 Wazzup! : 
15:00 Lunch? | 
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In-memory store: freshness 


O 0 à 


—, INSERT 


— 
SS 


Se 


added -> fail | — 


114:30 Hello, "Luke! 14: 30 Hello. Luke! 
14:35 Wazzup! 14:35 Wazzup! 


| | 
‘15:00 Lunch : ? 15:00 Lunch : ? 


In-memory store: freshness 


add 


———— 


Mutation ( Hint) 


Mutation ( Read Repair) 


SSTable Stream 


QS 
2 | 


Mutation 


INSERT 


Save: Hint 


Read Repair 


Streaming 
Repair 
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In-memory store: freshness 


=> 
=z 
Ee 


< 
=z =z 
==> s 


26 Q 


M. MN INSERT 


FS 


Mutation, Hint, Stream 


: add : 
daeta 
114:30 Hello, Luke! : 
14:35 Wazzup ! | 
15:00 Lunch? | 
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Mutation — individual rows: 


interface ApplyMutationListener package org.apache.cassandra.db, 
t 
void onApply(ByteBuffer key, public class Keyspace 
DeLetionTime deletion, { 
Iterator<Unfiltered> atoms); public void applyC Mutation mutation, 
} boolean writeCommitLog, 


boolean updatelndexes, 
boolean isDroppable ) 


Streaming — ss tables: 


package org.apache.cassandra.streaming; 


public class StreamReceiveTask extends StreamTask 
i 
// holds references to SSTables received 
protected Collection«SSTableReader» sstables; 


private static class OnCompletionRunnable implements Runnable 4 
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Messages In-Memory Store: state loss 


26 Q 


| : 3 add : INSERT 
| : ——1 > | 


== 
==> 
Ee 


14:30 Hello, Luke! 
14:35 Wazzup ! 


crash & reboot ! : : | Mutation 


Mutåtion ( Hint) 


15:00 Lunch? | | : : 11430 Hello, Luke! 
: : | : 14:35 Wazzup ! 


read 3 : get Messages: : : , 
E | | 15:00 Lunch? 


== | MEE 


po: => 
e 
Kg 
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Messages In-Memory Store: state loss 


write 
delete 
normal restart 
isEmpty ? 
loa 


14:30 Hello, Luke! 
14:35 Wazzaaap ! 


15:00 Lunch? : 


CREATE KEYSPACE Caches 
WITH REPLICATION - 1 
'class': 'LocalStrategy' 
} 


CREATE TABLE Caches.MessagesSnapshot ( 
rowkey blob, 
value blob, 
PRIMARY KEY ( rowkey ) 


SELECT * FROM MessagesSnapshot 
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In-memory store: optimizing normal restarts 


e Shared Memory 
https://github.com/odnoklassniki/one-nio 


e /dev/shm/msgs-cache.mem 


sometimes, not always e 
one.nl0. mem 
e tmpfs 
e hugetlbfs SharedMemoryMap 


Ak pages -> 2M,1G 
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In-memory store: waiting for consistency 


= D EO à 


restart : DE . INSERT 


== 
===> 
Ee 


Mutation 


: i getMessages : 


14:35 Wazzup ! 
15:00 Lunch? 


Mutation ( Hint) 


In-memory store: waiting for consistency 


o 0 à 


restart | add ㆍ | INSERT 


Mutåtion ( Hint) 


Have undelivered hints for me ? | 


: get Messages: 


none 
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In-memory store: summary 


» Shares process with the app and the database 


avoid marshalling and network costs, overreads 


» Data freshness problem 


Extended Cassandra with Listeners 


» State loss problem 


Use local tables and shared memory 


e Consistent 


as much as the database 


getLastMessages( chats[] ) 


» Multiple chats in request 


No single node owns all data 


e Fraction of them are in cache 


Meaningless to load inactive chats to cache 
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getLastMessages( chats[] ) 


» Multiple chats in request 


No single node owns all data 


e Fraction of them are in cache 


Meaningless to load inactive chats to cache 


memcaches 


Map<Long, Message» getLastMessages( long[] chatIds > 


split & merge 


e Multiple chats in request £e 


No single node owns all data 


e Fraction of them are in cache gur > AQ, 80, CH 


Meaningless to load inactive chats to cache 


| getLastMessages(A) | 
getLastMessages(B) | 
: getLastMessages(C) 
| | 00000 | | 
map(A) : 


merge(A,B,C) 
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Full Text Search : search and indexMessages 
hydra2020| Q gek L 


e Inverted index today 


Work in progress 


lucene.apache.org 


Hey! | just sent you my new presentation. 5 


| believe we should discuss it. ` 
03:34 


e One per conversation 


single 100TB index does not work; 


Do you like the overall design and and style of pictures? ` ` 


02-34 


Did you had a chance to look into that marketing data? —— 


per-user duplicates data 


| believe there should be a couple of slides on this. ` ` 


e Large conversations only 


Could you please add some? ` A 
03:4" 


Short chats index builds right before search 


Are you there? ` " 


indexMessages 


add 


INSERT 
: IndexWriter.addDocument 


IndexWriter.commit 


Commits are 
too often 


Too many open 
indexes 
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Compaction 


e Merges data generations 


Across sstable files 


e In defined order 


token(PartitioningKey), Clustering Key 


http://cassandra.apache.org/doc/latest/operating/compaction.html 


package org.apache.cassandra.db.compaction; 


public class CompactionManager implements CompactionManager! 


i 


App to DB 
tight integration = 
even less costs 
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Operations 


Longer deployments 


what takes time: how to mitigate: 

» DB initialization e Parallel deployment of all availability zone instances 
Depends on the disk speed and the volume 
of WAL to play 


e In-memory lost state load 


Cache size, contention, CPU 


e Consistency wait time 


| | 
H ul 


Number of Lost mutations 


More DB nodes restarts 


why: not a problem 
e DB is embedded into app e Nothing, this is good 
app restart == DB node restart 


Makes it easy to debug failures in controlled environment; 


Chaos monkey on a regular basis at no cost 


Hl 


|| 


d 
| 


| 


AZ | AZ 2 


AZ 3 59 


Longer scaleout time 


why: 
e state is colocated with application 


Scaleout includes data resharding 


how to mitigate: 
e Capacity planning 


Scaleout upfront 


e Feature flags in right places 
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Imbalanced resource utilization 


why: how to mitigate: 
e state is colocated with an application e. Container orchestration 


Number of app, cache and db are equal one-cloud, k8s, aurora, mesos 


One-cloud — the datacenter OS of ok.ru (Russian 
Oleg Anastasyev, 2018 
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Diagnostics and support https://github.com/jvm-profiling-tools/async-profiler 
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com/sun/proxy/$Proxy42.visitItemsLocal com/sun/proxy/SProxy.. onejli.. 
one/list/storage/server/ListStorageServiceImpl.visitItems Á D 


one/listístorage/server/ListStorageServicelmpl.getltemsNoLogging 
one/listístorage/server/ListStorageServicelmpl.getltems 


Stateful Services Summary 


e More effective and reliable 


avoid marshalling, overreads and network costs 


e Caches are now consistent with data 


Cache and C* are embedded in the single process 


» Not so hard to implement 
Really hard parts are already implemented in C* and one-nio 


» Stateless is starting to obsolete 
there are new solutions to the problems it was aimed for 


Effective and 


Reliable 
Microservices 


Oleg Anastasyev 


